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Abstract 

Many AI researchers and cognitive scientists have argued that analogy is the core 
of cognition. The most influential work on computational modeling of analogy- making is 
Structure Mapping Theory (SMT) and its implementation in the Structure Mapping Engine 
(SME). A limitation of SME is the requirement for complex hand-coded representations. 
We introduce the Latent Relation Mapping Engine (LRME), which combines ideas from 
SME and Latent Relational Analysis (LRA) in order to remove the requirement for hand- 
coded representations. LRME builds analogical mappings between lists of words, using a 
large corpus of raw text to automatically discover the semantic relations among the words. 
We evaluate LRME on a set of twenty analogical mapping problems, ten based on scientific 
analogies and ten based on common metaphors. LRME achieves human-level performance 
on the twenty problems. We compare LRME with a variety of alternative approaches and 
find that they are not able to reach the same level of performance. 

1. Introduction 

When we are faced with a problem, we try to recall similar problems that we have faced 
in the past, so that we can transfer our knowledge from past experience to the current 
problem. We make an analogy between the past situation and the current situation, and we 
use the analogy to transfer knowledge (Gentner, 1983; Minsky, 1986; Holyoak & Thagard, 
1995; Hofstadter, 2001; Hawkins & Blakeslee, 2004). 

In his survey of the computational modeling of analogy-making, French (2002) cites 
Structure Mapping Theory (SMT) (Gentner, 1983) and its implementation in the Structure 
Mapping Engine (SME) (Falkenhainer, Forbus, &: Gentner, 1989) as the most influential 
work on modeling of analogy-making. In SME, an analogical mapping M : A — > B is from 
a source A to a target B. The source is more familiar, more known, or more concrete, 
whereas the target is relatively unfamiliar, unknown, or abstract. The analogical mapping 
is used to transfer knowledge from the source to the target. 

Gentner (1983) argues that there are two kinds of similarity, attributional similarity 
and relational similarity. The distinction between attributes and relations may be under- 
stood in terms of predicate logic. An attribute is a predicate with one argument, such as 
large(X), meaning X is large. A relation is a predicate with two or more arguments, such 
as COLL1DES_with(X, Y), meaning X collides with Y. 

The Structure Mapping Engine prefers mappings based on relational similarity over 
mappings based on attributional similarity (Falkenhainer et ah, 1989). For example, SME 
is able to build a mapping from a representation of the solar system (the source) to a 
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representation of the Rutherford-Bohr model of the atom (the target). The sun is mapped 
to the nucleus, planets are mapped to electrons, and mass is mapped to charge. Note that 
this mapping emphasizes relational similarity. The sun and the nucleus are very different 
in terms of their attributes: the sun is very large and the nucleus is very small. Likewise, 
planets and electrons have little attributional similarity. On the other hand, planets revolve 
around the sun like electrons revolve around the nucleus. The mass of the sun attracts the 
mass of the planets like the charge of the nucleus attracts the charge of the electrons. 

Gentner (1991) provides evidence that children rely primarily on attributional similarity 
for mapping, gradually switching over to relational similarity as they mature. She uses the 
terms mere appearance to refer to mapping based mostly on attributional similarity, analogy 
to refer to mapping based mostly on relational similarity, and literal similarity to refer to a 
mixture of attributional and relational similarity. Since we use analogical mappings to solve 
problems and make predictions, we should focus on structure, especially causal relations, 
and look beyond the surface attributes of things (Gentner, 1983). The analogy between 
the solar system and the Rutherford-Bohr model of the atom illustrates the importance of 
going beyond mere appearance, to the underlying structures. 

Figures 1 and 2 show the LISP representations used by SME as input for the analogy 
between the solar system and the atom (Falkenhainer et al., 1989). Chalmers, French, 
and Hofstadter (1992) criticize SME's requirement for complex hand-coded representations. 
They argue that most of the hard work is done by the human who creates these high-level 
hand-coded representations, rather than by SME. 

(def Entity sun :type inanimate) 
(def Entity planet :type inanimate) 

(def Description solar-system 
entities (sun planet) 

expressions (((mass sun) :name mass-sun) 
((mass planet) :name mass-planet) 
((greater mass-sun mass-planet) :name >mass) 
((attracts sun planet) :name attracts-f orm) 
((revolve-around planet sun) :name revolve) 
((and >mass attracts-f orm) :name andl) 
((cause andl revolve) :name cause-revolve) 
((temperature sun) :name temp-sun) 
((temperature planet) :name temp-planet) 
((greater temp-sun temp-planet) :name >temp) 
((gravity mass-sun mass-planet) :name force-gravity) 
((cause force-gravity attracts-f orm) :name why-attracts))) 

Figure 1: The representation of the solar system in SME (Falkenhainer et al., 1989). 

Gentner, Forbus, and their colleagues have attempted to avoid hand-coding in their 
recent work with SME. 1 The CogSketch system can generate LISP representations from 
simple sketches (Forbus, Usher, Lovett, Lockwood, & Wetzel, 2008). The Gizmo system 
can generate LISP representations from qualitative physics models (Yan & Forbus, 2005). 
The Learning Reader system can generate LISP representations from natural language text 
(Forbus et al., 2007). These systems do not require LISP input. 

1. Dedre Gentner, personal communication, October 29, 2008. 
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(def Entity nucleus :type inanimate) 
(def Entity electron :type inanimate) 

(defDescription rutherf ord-atom 

entities (nucleus electron) 

expressions (((mass nucleus) :name mass-n) 
((mass electron) :name mass-e) 
((greater mass-n mass-e) :name >mass) 
((attracts nucleus electron) :name attracts-f orm) 
((revolve-around electron nucleus) :name revolve) 
((charge electron) :name q-electron) 
((charge nucleus) :name q-nucleus) 

((opposite-sign q-nucleus q-electron) :name >charge) 
((cause >charge attracts-f orm) :name why-attracts))) 

Figure 2: The Rutherford-Bohr model of the atom in SME (Falkenhainer et al., 1989). 

However, the CogSketch user interface requires the person who draws the sketch to iden- 
tify the basic components in the sketch and hand-label them with terms from a knowledge 
base derived from OpenCyc. Forbus et al. (2008) note that OpenCyc contains more than 
58,000 hand-coded concepts, and they have added further hand-coded concepts to OpenCyc, 
in order to support CogSketch. The Gizmo system requires the user to hand-code a physical 
model, using the methods of qualitative physics (Yan & Forbus, 2005). Learning Reader 
uses more than 28,000 phrasal patterns, which were derived from ResearchCyc (Forbus 
et al., 2007). It is evident that SME still requires substantial hand-coded knowledge. 

The work we present in this paper is an effort to avoid complex hand-coded representa- 
tions. Our approach is to combine ideas from SME (Falkenhainer et al., 1989) and Latent 
Relational Analysis (LRA) (Turney, 2006). We call the resulting algorithm the Latent Re- 
lation Mapping Engine (LRME). We represent the semantic relation between two terms 
using a vector, in which the elements are derived from pattern frequencies in a large corpus 
of raw text. Because the semantic relations are automatically derived from a corpus, LRME 
does not require hand-coded representations of relations. It only needs a list of terms from 
the source and a list of terms from the target. Given these two lists, LRME uses the corpus 
to build representations of the relations among the terms, and then it constructs a mapping 
between the two lists. 

Tables 1 and 2 show the input and output of LRME for the analogy between the solar 
system and the Rutherford-Bohr model of the atom. Although some human effort is involved 
in constructing the input lists, it is considerably less effort than SME requires for its input 
(contrast Figures 1 and 2 with Table 1). 

Scientific analogies, such as the analogy between the solar system and the Rutherford- 
Bohr model of the atom, may seem esoteric, but we believe analogy-making is ubiquitous 
in our daily lives. A potential practical application for this work is the task of identifying 
semantic roles (Gildea & Jurafsky, 2002). Since roles are relations, not attributes, it is 
appropriate to treat semantic role labeling as an analogical mapping problem. 

For example, the Judgement semantic frame contains semantic roles such as judge, 
evaluee, and REASON, and the Statement frame contains roles such as speaker, ad- 
dressee, message, topic, and medium (Gildea & Jurafsky, 2002). The task of identifying 
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Source A Target B 

planet revolves 

attracts atom 

revolves attracts 

sun electromagnetism 

gravity nucleus 

solar system charge 

mass electron 



Table 1: The representation of the input in LRME. 



Source A Mapping M Target B 

solar system — > atom 

sun — > nucleus 

planet — > electron 

mass — > charge 

attracts — > attracts 

revolves — > revolves 

gravity — > electromagnetism 



Table 2: The representation of the output in LRME. 



semantic roles is to automatically label sentences with their roles, as in the following exam- 
ples (Gildea & Jurafsky, 2002): 

• [Judge She] blames [Evaluee the Government] [Reason for failing to do enough to 
help] . 

• [Speaker We] talked [Topic about the proposal] [Medium over the phone]. 

If we have a training set of labeled sentences and a testing set of unlabeled sentences, then 
we may view the task of labeling the testing sentences as a problem of creating analogical 
mappings between the training sentences (sources) and the testing sentences (targets). Ta- 
ble 3 shows how "She blames the Government for failing to do enough to help." might be 
mapped to "They blame the company for polluting the environment." Once a mapping has 
been found, we can transfer knowledge, in the form of semantic role labels, from the source 
to the target. 



Source A 


Mapping M 


Target B 


she 




they 


blames 




blame 


government 




company 


failing 




polluting 


help 




environment 



Table 3: Semantic role labeling as analogical mapping. 

In Section 2, we briefly discuss the hypotheses behind the design of LRME. We then 
precisely define the task that is performed by LRME, a specific form of analogical mapping, 
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in Section 3. LRME builds on Latent Relational Analysis (LRA), hence we summarize LRA 
in Section 4. We discuss potential applications of LRME in Section 5. 

To evaluate LRME, we created twenty analogical mapping problems, ten science anal- 
ogy problems (Holyoak & Thagard, 1995) and ten common metaphor problems (Lakoff & 
Johnson, 1980). Table 1 is one of the science analogy problems. Our intended solution is 
given in Table 2. To validate our intended solutions, we gave our colleagues the lists of 
terms (as in Table 1) and asked them to generate mappings between the lists. Section 6 
presents the results of this experiment. Across the twenty problems, the average agreement 
with our intended solutions (as in Table 2) was 87.6%. 

The LRME algorithm is outlined in Section 7, along with its evaluation on the twenty 
mapping problems. LRME achieves an accuracy of 91.5%. The difference between this 
performance and the human average of 87.6% is not statistically significant. 

Section 8 examines a variety of alternative approaches to the analogy mapping task. The 
best approach achieves an accuracy of 76.8%, but this approach requires hand-coded part- 
of-speech tags. This performance is significantly below LRME and human performance. 

In Section 9, we discuss some questions that are raised by the results in the preceding 
sections. Related work is described in Section 10, future work and limitations are considered 
in Section 11, and we conclude in Section 12. 

2. Guiding Hypotheses 

In this section, we list some of the assumptions that have guided the design of LRME. The 
results we present in this paper do not necessarily require these assumptions, but it might 
be helpful to the reader, to understand the reasoning behind our approach. 

1. Analogies and semantic relations: Analogies are based on semantic relations 
(Gentner, 1983). For example, the analogy between the solar system and the Ruther- 
ford-Bohr model of the atom is based on the similarity of the semantic relations 
among the concepts involved in our understanding of the solar system to the semantic 
relations among the concepts involved in the Rutherford-Bohr model of the atom. 

2. Co-occurrences and semantic relations: Two terms have an interesting, signif- 
icant semantic relation if and only if they they tend to co-occur within a relatively 
small window (e.g., five words) in a relatively large corpus (e.g., 10 10 words). Having 
an interesting semantic relation causes co-occurrence and co-occurrence is a reliable 
indicator of an interesting semantic relation (Firth, 1957). 

3. Meanings and semantic relations: Meaning has more to do with relations among 
words than individual words. Individual words tend to be ambiguous and polysemous. 
By putting two words into a pair, we constrain their possible meanings. By putting 
words into a sentence, with multiple relations among the words in the sentence, we 
constrain the possible meanings further. If we focus on word pairs (or tuples), instead 
of individual words, word sense disambiguation is less problematic. Perhaps a word 
has no sense apart from its relations with other words (Kilgarriff, 1997). 

4. Pattern distributions and semantic relations: There is a many-to-many map- 
ping between semantic relations and the patterns in which two terms co-occur. For 
example, the relation CauseEffect(X, Y) may be expressed as "A causes Y" , "Y 
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from X" , "Y due to X" , "Y because of X" , and so on. Likewise, the pattern 
"Y from X" may be an expression of CauseEffect(X, Y) ("sick from bacteria") or 
OriginEntity(X, Y) ("oranges from Spain"). However, for a given X and Y, the sta- 
tistical distribution of patterns in which X and Y co-occur is a reliable signature of 
the semantic relations between X and Y (Turney, 2006). 

To the extent that LRME works, we believe its success lends some support to these hy- 
potheses. 

3. The Task 

In this paper, we examine algorithms that generate analogical mappings. For simplicity, we 
restrict the task to generating bijective mappings; that is, mappings that are both injective 
(one-to-one; there is no instance in which two terms in the source map to the same term 
in the target) and surjective (onto; the source terms cover all of the target terms; there is 
no target term that is left out of the mapping). We assume that the entities that are to be 
mapped are given as input. Formally, the input / for the algorithms is two sets of terms, A 
and B. 

I = {(AB)} (1) 
Since the mappings are bijective, A and B must contain the same number of terms, m. 



A = {a 1 ,a 2 ,...,a m } (2) 
B = {b 1 ,b 2 ,...,b m } (3) 

A term, ctj or bj, may consist of a single word (planet) or a compound of two or more words 
(solar system). The words may be any part of speech (nouns, verbs, adjectives, or adverbs). 
The output O is a bijective mapping M from AtoB. 



= {M :A^ B} (4) 

M(ai) € B (5) 

M(A) = {M( ai ), M(a 2 ), M(a m )} = B (6) 

The algorithms that we consider here can accept a batch of multiple independent mapping 
problems as input and generate a mapping for each one as output. 

I = {(A 1 ,B 1 ),(A 2 ,B 2 ),...,(A n ,B n )} (7) 

= {M 1 :A 1 ^B 1 ,M 2 :A 2 ^B 2 ,...,M n :A n ^ B n } (8) 

Suppose the terms in A are in some arbitrary order a. 

a = (ai,a 2 , . . . ,a m ) (9) 



The mapping function M : A — > B, given a, determines a unique ordering b of B. 
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b=(M( ai ),M(a 2 ),...,M(a m )) (10) 

Likewise, an ordering b of B, given a, defines a unique mapping function M. Since there 
are ml possible orderings of B, there are also ml possible mappings from A to B. The task 
is to search through the ml mappings and find the best one. (Section 6 shows that there is 
a relatively high degree of consensus about which mappings are best.) 

Let P(A, B) be the set of all ml bijective mappings from A to B. (P stands for permu- 
tation, since each mapping corresponds to a permutation.) 



P(A,B) = {M 1 ,M 2 ,...,M ml } (11) 
m=\A\ = \B\ (12) 
m\ = \P(A,B)\ (13) 

In the following experiments, m is 7 on average and 9 at most, so ml is usually around 
7! = 5,040 and at most 9! = 362,880. It is feasible for us to exhaustively search P(A, B). 

We explore two basic kinds of algorithms for generating analogical mappings, algorithms 
based on attributional similarity and algorithms based on relational similarity (Turney, 
2006). The attributional similarity between two words, sim a (a,6) G K, depends on the 
degree of correspondence between the properties of a and b. The more correspondence 
there is, the greater their attributional similarity. The relational similarity between two 
pairs of words, sim r (a :b,c: d) G 3ft, depends on the degree of correspondence between the 
relations of a : b and c : d. The more correspondence there is, the greater their relational 
similarity. For example, dog and wolf 'have a relatively high degree of attributional similarity, 
whereas dog : bark and cat : meow have a relatively high degree of relational similarity. 

Attributional mapping algorithms seek the mapping (or mappings) M a that maximizes 
the sum of the attributional similarities between the terms in A and the corresponding 
terms in B. (When there are multiple mappings that maximize the sum, we break the tie 
by randomly choosing one of them.) 

m 

M a = argmax V] sim a (aj, M(aj)) (14) 

MdP{A,B) 

Relational mapping algorithms seek the mapping (or mappings) M T that maximizes the 
sum of the relational similarities. 

m m 

M T = argmax sim r (dj : aj, M(a«) : M(aj)) (15) 

MeP(A,B) i=lj=i+l 

In (15), we assume that sim r is symmetrical. For example, the degree of relational similarity 
between dog : bark and cat : meow is the same as the degree of relational similarity between 
bark : dog and meow : cat. 

sim r (a: 6, c:d) = sim r (6: a, d: c) (16) 

We also assume that sim r (a:a, b:b) is not interesting; for example, it may be some constant 
value for all a and b. Therefore (15) is designed so that i is always less than j. 
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Let score r (M) and score a (M) be denned as follows. 



m m 



score r (M) 



sim r (aj :aj, M(aj) :M(aj)) 

i=l j=i+l 



(17) 



score a (M) 



^2 sim a (ai,M(ai)) 



(18) 



i=i 



Now M r and M a may be denned in terms of score r (M) and score a (M). 



M r = argmax score r (M) 

MeP(A,B) 



(19) 



M a = argmax score a (M) 
MeP(A,B) 



(20) 



M r is the best mapping according to sim r and M a is the best mapping according to sim a . 

Recall Gentner's (1991) terms, discussed in Section 1, mere appearance (mostly attribu- 
tional similarity), analogy (mostly relational similarity), and literal similarity (a mixture of 
attributional and relational similarity). We take it that M r is an abstract model of map- 
ping based on analogy and M a is a model of mere appearance. For literal similarity, we can 
combine M r and M a , but we should take care to normalize score r (M) and score a (M) before 
we combine them. (We experiment with combining them in Section 9.2.) 

4. Latent Relational Analysis 

LRME uses a simplified form of Latent Relational Analysis (LRA) (Turney, 2005, 2006) 
to calculate the relational similarity between pairs of words. We will briefly describe past 
work with LRA before we present LRME. 

LRA takes as input I a set of word pairs and generates as output O the relational 
similarity sim r (aj : 6j, aj : bj) between any two pairs in the input. 



LRA was designed to evaluate proportional analogies. Proportional analogies have the form 
a:b::c:d, which means "a is to b as c is to d" . For example, mason: stone:: carpenter: wood 
means "mason is to stone as carpenter is to wood" . A mason is an artisan who works with 
stone and a carpenter is an artisan who works with wood. 

We consider proportional analogies to be a special case of bijective analogical mapping, 
as defined in Section 3, in which \ A\ = \B\ = m = 2. For example, a\ : a 2 :: 61 : 62 is equivalent 
to M in (23). 



/ = {a 1 :b 1 ,a 2 :b 2 , ■ ■ ■ ,a n :b n } 
O = {sim r : I x / -> 3?} 



(21) 
(22) 



A = {01,02}, B = {61,62}, M (ai) = 61, M (a 2 ) = 6 2 . 
From the definition of score r (M) in (17), we have the following result for Mq. 



(23) 
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score r (M ) = sim r (ai :a 2 , M (ai) :M (a 2 )) = sim r (ai :a 2 , &i :fr 2 ) (24) 

That is, the quality of the proportional analogy mason: stone:: carpenter: wood is given by 
sim r {mason : stone, carpenter : wood). 

Proportional analogies may also be evaluated using attributional similarity. From the 
definition of score a (M) in (18), we have the following result for Mq. 

score a (M ) = sim a (ai, M (a 1 )) + sim a (a 2 , M (a 2 )) = sim a (ai, b{) + sim a (a 2 , b 2 ) (25) 

For attributional similarity, the quality of the proportional analogy mason: stone :: carpenter: 
wood is given by sim a (mason, carpenter) + sim a (stone, wood). 

LRA only handles proportional analogies. The main contribution of LRME is to extend 
LRA beyond proportional analogies to bijective analogies for which m > 2. 

Turney (2006) describes ten potential applications of LRA: recognizing proportional 
analogies, structure mapping theory, modeling metaphor, classifying semantic relations, 
word sense disambiguation, information extraction, question answering, automatic the- 
saurus generation, information retrieval, and identifying semantic roles. Two of these 
applications (evaluating proportional analogies and classifying semantic relations) are ex- 
perimentally evaluated, with state-of-the-art results. 

Turney (2006) compares the performance of relational similarity (24) and attributional 
similarity (25) on the task of solving 374 multiple-choice proportional analogy questions from 
the SAT college entrance test. LRA is used to measure relational similarity and a variety 
of lexicon-based and corpus-based algorithms are used to measure attributional similarity. 
LRA achieves an accuracy of 56% on the 374 SAT questions, which is not significantly 
different from the average human score of 57%. On the other hand, the best performance 
by attributional similarity is 35%. The results show that attributional similarity is better 
than random guessing, but not as good as relational similarity. This result is consistent 
with Gentner's (1991) theory of the maturation of human similarity judgments. 

Turney (2006) also applies LRA to the task of classifying semantic relations in noun- 
modifier expressions. A noun-modifier expression is a phrase, such as laser printer, in which 
the head noun (printer) is preceded by a modifier (laser). The task is to identify the semantic 
relation between the noun and the modifier. In this case, the relation is instrument; the 
laser is an instrument used by the printer. On a set of 600 hand-labeled noun-modifier pairs 
with five different classes of semantic relations, LRA attains 58% accuracy. 

Turney (2008) employs a variation of LRA for solving four different language tests, 
achieving 52% accuracy on SAT analogy questions, 76% accuracy on TOEFL synonym 
questions, 75% accuracy on the task of distinguishing synonyms from antonyms, and 77% 
accuracy on the task of distinguishing words that are similar, words that are associated, 
and words that are both similar and associated. The same core algorithm is used for all 
four tests, with no tuning of the parameters to the particular test. 

5. Applications for LRME 

Since LRME is an extension of LRA, every potential application of LRA is also a potential 
application of LRME. The advantage of LRME over LRA is the ability to handle bijective 
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analogies when m > 2 (where m = \A\ = \B\). In this section, we consider the kinds of 
applications that might benefit from this ability. 

In Section 7.2, we evaluate LRME on science analogies and common metaphors, which 
supports the claim that these two applications benefit from the ability to handle larger sets 
of terms. In Section 1, we saw that identifying semantic roles (Gildea & Jurafsky, 2002) 
also involves more than two terms, and we believe that LRME will be superior to LRA for 
semantic role labeling. 

Semantic relation classification usually assumes that the relations are binary; that is, 
a semantic relation is a connection between two terms (Rosario & Hearst, 2001; Nastase 
& Szpakowicz, 2003; Turney, 2006; Girju et al., 2007). Yuret observed that binary rela- 
tions may be linked by underlying n-ary relations. 2 For example, Nastase and Szpakowicz 
(2003) defined a taxonomy of 30 binary semantic relations. Table 4 shows how six bi- 
nary relations from Nastase and Szpakowicz (2003) can be covered by one 5-ary relation, 
Agent:Tool:Action:Affected:Theme. An Agent uses a Tool to perform an Action. Somebody 
or something is Affected by the Action. The whole event can be summarized by its Theme. 



Nastase and Szpakowicz (2003) 



Relation 


Example 


Agent :Tool: Action: Affected:Theme 


agent 


student protest 


Agent Action 


purpose 


concert hall 


Theme: Tool 


beneficiary 


student discount 


Affected Action 


instrument 


laser printer 


ToolAgent 


object 


metal separator 


AffectedTool 


object property 


sunken ship 


Action Affected 



Table 4: How six binary semantic relations from Nastase and Szpakowicz (2003) can be 
viewed as different fragments of one 5-ary semantic relation. 

In SemEval Task 4, we found it easier to manually tag the datasets when we expanded 
binary relations to their underlying n-ary relations (Girju et al., 2007). We believe that this 
expansion would also facilitate automatic classification of semantic relations. The results 
in Section 9.3 suggest that all of the applications for LRA that we discussed in Section 4 
might benefit from being able to handle bijective analogies when m > 2. 

6. The Mapping Problems 

To evaluate our algorithms for analogical mapping, we created twenty mapping problems, 
given in Appendix A. The twenty problems consist of ten science analogy problems, based 
on examples of analogy in science from Chapter 8 of Holyoak and Thagard (1995), and ten 
common metaphor problems, derived from Lakoff and Johnson (1980). 

The tables in Appendix A show our intended mappings for each of the twenty prob- 
lems. To validate these mappings, we invited our colleagues in the Institute for Information 
Technology to participate in an experiment. The experiment was hosted on a web server 

2. Deniz Yuret, personal communication, February 13, 2007. This observation was in the context of our 
work on building the datasets for SemEval 2007 Task 4 (Girju et al., 2007). 
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(only accessible inside our institute) and people participated anonymously, using their web 
browsers in their offices. There were 39 volunteers who began the experiment and 22 who 
went all the way to the end. In our analysis, we use only the data from the 22 participants 
who completed all of the mapping problems. 

The instructions for the participants are in Appendix A. The sequence of the problems 
and the order of the terms within a problem were randomized separately for each participant, 
to remove any effects due to order. Table 5 shows the agreement between our intended 
mapping and the mappings generated by the participants. Across the twenty problems, 
the average agreement was 87.6%, which is higher than the agreement figures for many 
linguistic annotation tasks. This agreement is impressive, given that the participants had 
minimal instructions and no training. 



Type Mapping Source — > Target Agreement m 





Al 


solar system — > atom 


90.9 


7 




A2 


water flow — ► heat transfer 


86.9 


8 




A3 


waves — > sounds 


81.8 


8 




A4 


combustion — > respiration 


79.0 


8 


science 


A5 


sound — > light 


79.2 


7 


analogies 


A6 


projectile — ► planet 


97.4 


7 




A7 


artificial selection — ► natural selection 


74.7 


7 




A8 


billiard balls — > gas molecules 


88.1 


8 




A9 


computer — ► mind 


84.3 


9 




A10 


slot machine — > bacterial mutation 


83.6 


5 




Ml 


war — ► argument 


93.5 


7 




M2 


buying an item — ► accepting a belief 


96.1 


7 




M3 


grounds for a building — ► reasons for a theory 


87.9 


6 




M4 


impediments to travel — > difficulties 


100.0 


7 


common 


M5 


money — > time 


77.3 


6 


metaphors 


M6 


seeds — > ideas 


89.0 


7 




M7 


machine — ► mind 


98.7 


7 




M8 


object — ► idea 


89.1 


5 




M9 


following — > understanding 


96.6 


8 




M10 


seeing — ► understanding 


78.8 


6 


Average 






87.6 


7.0 



Table 5: The average agreement between our intended mappings and the mappings of the 
22 participants. See Appendix A for the details. 



The column labeled m gives the number of terms in the set of source terms for each 
mapping problem (which is equal to the number of terms in the set of target terms). For the 
average problem, m = 7. The third column in Table 5 gives a mnemonic that summarizes 
the mapping (e.g., solar system — > atom). Note that the mnemonic is not used as input for 
any of the algorithms, nor was the mnemonic shown to the participants in the experiment. 

The agreement figures in Table 5 for each individual mapping problem are averages over 
the m mappings for each problem. Appendix A gives a more detailed view, showing the 
agreement for each individual mapping in the m mappings. The twenty problems contain 
a total of 140 individual mappings (20 x 7). Appendix A shows that every one of these 140 
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mappings has an agreement of 50% or higher. That is, in every case, the majority of the 
participants agreed with our intended mapping. (There are two cases where the agreement 
is exactly 50%. See problems A5 in Table 14 and M5 in Table 16 in Appendix A.) 

If we select the mapping that is chosen by the majority of the 22 participants, then we 
will get a perfect score on all twenty problems. More precisely, if we try all ml mappings for 
each problem, and select the mapping that maximizes the sum of the number of participants 
who agree with each individual mapping in the m mappings, then we will have a score of 
100% on all twenty problems. This is strong support for the intended mappings that are 
given in Appendix A. 

In Section 3, we applied Genter's (1991) categories - mere appearance (mostly attribu- 
tional similarity), analogy (mostly relational similarity), and literal similarity (a mixture 
of attributional and relational similarity) - to the mappings M r and M a , where M r is the 
best mapping according to sim r and M a is the best mapping according to sim a . The twenty 
mapping problems were chosen as analogy problems; that is, the intended mappings in 
Appendix A are meant to be relational mappings, M r ; mappings that maximize relational 
similarity, sim r . We have tried to avoid mere appearance and literal similarity. 

In Section 7 we use the twenty mapping problems to evaluate a relational mapping 
algorithm (LRME), and in Section 8 we use them to evaluate several different attributional 
mapping algorithms. Our hypothesis is that LRME will perform significantly better than 
any of the attributional mapping algorithms on the twenty mapping problems, because they 
are analogy problems (not mere appearance problems and not literal similarity problems). 
We expect relational and attributional mapping algorithms would perform approximately 
equally well on literal similarity problems, and we expect that mere appearance problems 
would favour attributional algorithms over relational algorithms, but we do not test these 
latter two hypotheses, because our primary interest in this paper is analogy-making. 

Our goal is to test the hypothesis that there is a real, practical, effective, measurable 
difference between the output of LRME and the output of the various attributional map- 
ping algorithms. A skeptic might claim that relational similarity sim r (a :b,c:d) can be 
reduced to attributional similarity sim a (a, c) + sim a (6, d); therefore our relational mapping 
algorithm is a complicated solution to an illusory problem. A slightly less skeptical claim 
is that relational similarity versus attributional similarity is a valid distinction in cognitive 
psychology, but our relational mapping algorithm does not capture this distinction. To test 
our hypothesis and refute these skeptical claims, we have created twenty analogical mapping 
problems, and we will show that LRME handles these problems significantly better than 
the various attributional mapping algorithms. 

7. The Latent Relation Mapping Engine 

The Latent Relation Mapping Engine (LRME) seeks the mapping M r that maximizes the 
sum of the relational similarities. 

m m 

M T = argmax srm r ( a i '■ a j> M(a,i) : M(a,j)) (26) 

MeP(A,B) i=lj=i+1 

We search for M r by exhaustively evaluating all of the possibilities. Ties are broken ran- 
domly. We use a simplified form of LRA (Turney, 2006) to calculate sim r . 
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7.1 Algorithm 

Briefly, the idea of LRME is to build a pair-pattern matrix X, in which the rows correspond 
to pairs of terms and the columns correspond to patterns. For example, the row Xj : might 
correspond to the pair of terms sun : solar system and the column x : j might correspond to 
the pattern "* X centered Y *". In these patterns, "*" is a wild card, which can match 
any single word. The value of an element Xij in X is based on the frequency of the pattern 
for x : j, when X and Y are instantiated by the terms in the pair for Xj : . For example, if we 
take the pattern "* X centered Y *" and instantiate X : Y with the pair sun : solar system, 
then we have the pattern "* sun centered solar system *" , and thus the value of the element 

is based on the frequency of "* sun centered solar system *" in the corpus. The matrix 
X is smoothed with a truncated singular value decomposition (SVD) (Golub & Van Loan, 
1996) and the relational similarity sim r between two pairs of terms is given by the cosine of 
the angle between the two corresponding row vectors in X. 

In more detail, LRME takes as input / a set of mapping problems and generates as 
output O a corresponding set of mappings. 

I = {(A 1 ,B 1 ),(A 2 ,B 2 ),...,(A n ,B n )} (27) 
= {M 1 :A 1 ^B 1 ,M 2 :A 2 ^B 2 ,...,M n :A n ^ B n ] (28) 

In the following experiments, all twenty mapping problems (Appendix A) are processed in 
one batch (n = 20). 

The first step is to make a list R that contains all pairs of terms in the input /. For 
each mapping problem (A, B) in /, we add to R all pairs a, t : aj, such that ctj and aj are 
members of A, i ^ j, and all pairs bi : bj, such that b{ and bj are members of B, i ^ j. 
If \A\ = \B\ = m, then there are m(m — 1) pairs from A and m(m — 1) pairs from B. 3 A 
typical pair in R would be sun : solar system. We do not allow duplicates in R; R is a list 
of pair types, not pair tokens. For our twenty mapping problems, R is a list of 1,694 pairs. 

For each pair r in R, we make a list S(r) of the phrases in the corpus that contain the 
pair r. Let : aj be the terms in the pair r. We search in the corpus for all phrases of the 
following form: 

"[0 to 1 words] ai [0 to 3 words] aj [0 to 1 words]" (29) 

If a; : aj is in R, then aj : ai is also in R, so we find phrases with the members of the pairs 
in both orders, S(ai : aj) and S(aj : ai). The search template (29) is the same as used by 
Turney (2008). 

In the following experiments, we search in a corpus of 5 x 10 10 English words (about 280 
GB of plain text), consisting of web pages gathered by a web crawler. 4 To retrieve phrases 

3. We have m(m — 1) here, not m(m — l)/2, because we need the pairs in both orders. We only want 
to calculate sim r for one order of the pairs, because i is always less than j in (26); however, to ensure 
that sim r is symmetrical, as in (16), we need to make the matrix X symmetrical, by having rows in the 
matrix for both orders of every pair. 

4. The corpus was collected by Charles Clarke at the University of Waterloo. We can provide copies of the 
corpus on request. 
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from the corpus, we use Wumpus (Biittcher & Clarke, 2005), an efficient search engine for 
passage retrieval from large corpora. 5 

With the 1,694 pairs in R, we find a total of 1,996,464 phrases in the corpus, an average 
of about 1,180 phrases per pair. For the pair r = sun: solar system, a typical phrase s in 
S(r) would be "a sun centered solar system illustrates" . 

Next we make a list C of patterns, based on the phrases we have found. For each pair 
r in R, where r = Oj : aj, if we found a phrase s in S(r), then we replace Oj in s with X 
and we replace aj with Y. The remaining words may be either left as they are or replaced 
with a wild card symbol We then replace in s with Y and aj with X, and replace 
the remaining words with wild cards or leave them as they are. If there are n remaining 
words in s, after ai and aj are replaced, then we generate 2 n+1 patterns from s, and we add 
these patterns to C. We only add new patterns to C; that is, C is a list of pattern types, 
not pattern tokens; there are no duplicates in C. 

For example, for the pair sun : solar system, we found the phrase "a sun centered solar 
system illustrates". When we replace a% : aj with X : Y, we have "a X centered Y 
illustrates". There are three remaining words, so we can generate eight patterns, such as 
"a X * Y illustrates" , "a X centered Y *" , "* X * Y illustrates" , and so on. Each of these 
patterns is added to C. Then we replace a« : aj with Y : X, yielding "a Y centered X 
illustrates". This gives us another eight patterns, such as "a Y centered X *". Thus the 
phrase "a sun centered solar system illustrates" generates a total of sixteen patterns, which 
we add to C. 

Now we revise R, to make a list of pairs that will correspond to rows in the frequency 
matrix F. We remove any pairs from R for which no phrases were found in the corpus, 
when the terms were in either order. Let a« : aj be the terms in the pair r. We remove 
r from R if both S{ai : aj) and S(aj : en) are empty. We remove such rows because they 
would correspond to zero vectors in the matrix F. This reduces R from 1,694 pairs to 1,662 
pairs. Let n r be the number of pairs in R. 

Next we revise C, to make a list of patterns that will correspond to columns in the 
frequency matrix F. In the following experiments, at this stage, C contains millions of 
patterns, too many for efficient processing with a standard desktop computer. We need to 
reduce C to a more manageable size. We select the patterns that are shared by the most 
pairs. Let c be a pattern in C. Let r be a pair in R. If there is a phrase s in S(r), such 
that there is a pattern generated from s that is identical to c, then we say that r is one of 
the pairs that generated c. We sort the patterns in C in descending order of the number 
of pairs in R that generated each pattern, and we select the top tn r patterns from this 
sorted list. Following Turney (2008), we set the parameter t to 20; hence C is reduced to 
the top 33,240 patterns (tn r = 20 x 1,662 = 33,240). Let n c be the number of patterns in 
C (n c = tn r ). 

Now that the rows R and columns C are defined, we can build the frequency matrix 
F. Let ri be the i-th pair of terms in R (e.g., let rj be sun: solar system) and let Cj be 
the j-th pattern in C (e.g., let Cj be "* X centered Y *"). We instantiate X and Y in the 
pattern Cj with the terms in r« ( "* sun centered solar system *" ) . The element in F is 
the frequency of this instantiated pattern in the corpus. 

5. Wumpus was developed by Stefan Biittcher and it is available at http://www.wumpus-search.org/. 
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Note that we do not need to search again in the corpus for the instantiated pattern for 
fij, in order to find its frequency. In the process of creating each pattern, we can keep track 
of how many phrases generated the pattern, for each pair. We can get the frequency for 
by checking our record of the patterns that were generated by r j . 

The next step is to transform the matrix F of raw frequencies into a form X that 
enhances the similarity measurement. Turney (2006) used the log entropy transformation, 
as suggested by Landauer and Dumais (1997). This is a kind of tf-idf (term frequency 
times inverse document frequency) transformation, which gives more weight to elements in 
the matrix that are statistically surprising. However, Bullinaria and Levy (2007) recently 
achieved good results with a new transformation, called PPMIC (Positive Pointwise Mutual 
Information with Cosine); therefore LRME uses PPMIC. The raw frequencies in F are used 
to calculate probabilities, from which we can calculate the pointwise mutual information 
(PMI) of each element in the matrix. Any element with a negative PMI is then set to zero. 



Pi* = v^«r x^n c f (31) 
Z^i=l Z^j=l Jij 
y^r j . 

P*j = v^™r X^n c f (32) 
Z^i=l Z^j=l Jij 

pmiy = log ( J (33) 

\Pi*P*j J 

otherwise 



Let Ti be the z-th pair of terms in R (e.g., let rj be sun: solar system) and let Cj be the 
j-th pattern in C (e.g., let Cj be "* X centered Y *"). In (33), pij is the estimated probability 
of the of the pattern Cj instantiated with the pair rj ("* sun centered solar system *"), pj* 
is the estimated probability of r , , and p*j is the estimated probability of Cj . If r j and Cj are 
statistically independent, then Pi*p*j = p^ (by the definition of independence), and thus 
pmijj is zero (since log(l) = 0). If there is an interesting semantic relation between the 
terms in rj, and the pattern Cj captures an aspect of that semantic relation, then we should 
expect p^ to be larger than it would be if rj and Cj were indepedent; hence we should find 
that p^ > pi*p*j, and thus pmijj is positive. (See Hypothesis 2 in Section 2.) On the other 
hand, terms from completely different domains may avoid each other, in which case we 
should find that pmijj is negative. PPMIC is designed to give a high value to Xij when the 
pattern Cj captures an aspect of the semantic relation between the terms in rj; otherwise, 

should have a value of zero, indicating that the pattern Cj tells us nothing about the 
semantic relation between the terms in rj. 

In our experiments, F has a density of 4.6% (the percentage of nonzero elements) and 
X has a density of 3.8%. The lower density of X is due to elements with a negative PMI, 
which are transformed to zero by PPMIC. 
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Now we smooth X by applying a truncated singular value decomposition (SVD) (Golub 
& Van Loan, 1996). We use SVDLIBC to calculate the SVD of X. 6 SVDLIBC is designed 
for sparse (low density) matrices. SVD decomposes X into the product of three matrices 
USV T , where U and V are in column orthonormal form (i.e., the columns are orthogonal 
and have unit length, U T U = V T V = I) and S is a diagonal matrix of singular values 
(Golub & Van Loan, 1996). If X is of rank r, then XI is also of rank r. Let where 
k < r, be the diagonal matrix formed from the top k singular values, and let and be 
the matrices produced by selecting the corresponding columns from U and V. The matrix 
UfcSfcVj is the matrix of rank k that best approximates the original matrix X, in the sense 
that it minimizes the approximation errors. That is, X = U^XlfcVj minimizes ||X — X||_p 
over all matrices X of rank k, where || . . . \\p denotes the Frobenius norm (Golub & Van 
Loan, 1996). We may think of this matrix UfcX^Vj as a smoothed or compressed version 
of the original matrix X. Following Turney (2006), we set the parameter k to 300. 

The relational similarity sim r between two pairs in R is the inner product of the two 
corresponding rows in UfeS^Vj, after the rows have been normalized to unit length. We can 
simplify calculations by dropping (Deerwester, Dumais, Landauer, Furnas, h Harshman, 
1990). We take the matrix U^E^ and normalize each row to unit length. Let W be the 
resulting matrix. Now let Z be WW T , a square matrix of size n r xn r . This matrix contains 
the cosines of all combinations of two pairs in R. 

For a mapping problem {A, B) in /, let a : a' be a pair of terms from A and let b : b' be 
a pair of terms from B. Suppose that V{ = a : a' and rj = b : b', where r-j and rj are the 
i-th and j'-th pairs in R. Then sim r (a : a',b : b') = Zij, where Zij is the element in the i-th 
row and j'-th column of Z. If either a : a' or b : b' is not in R, because S(a : a'), S(a' : a), 
S(b : b'), or S(b' : b) is empty, then we set the similarity to zero. Finally, for each mapping 
problem in /, we output the map M r that maximizes the sum of the relational similarities. 

m m 

M r = argmax \ , / , srm r (oi '-o-j, M(a,i) :M(a,j)) (35) 

MeP(A,B) i=lj=i+1 

The simplified form of LRA used here to calculate sim r differs from LRA used by Turney 
(2006) in several ways. In LRME, there is no use of synonyms to generate alternate forms of 
the pairs of terms. In LRME, there is no morphological processing of the terms. LRME uses 
PPMIC (Bullinaria & Levy, 2007) to process the raw frequencies, instead of log entropy. 
Following Turney (2008), LRME uses a slightly different search template (29) and LRME 
sets the number of columns n c to tn r , instead of using a constant. In Section 7.2, we 
evaluate the impact of two of these changes (PPMIC and n c ), but we have not tested 
the other changes, which were mainly motivated by a desire for increased efficiency and 
simplicity. 

7.2 Experiments 

We implemented LRME in Perl, making external calls to Wumpus for searching the corpus 
and to SVDLIBC for calculating SVD. We used the Perl Net::Telnet package for interprocess 



6. SVDLIBC is the work of Doug Rohde and it is available at http://tedlab.mit.edu/~dr/svdlibc/. 
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communication with Wumpus, the PDL (Perl Data Language) package for matrix manipu- 
lations (e.g., calculating cosines), and the List::Permutor package to generate permutations 
(i.e., to loop through P(A,B)). 

We ran the following experiments on a dual core AMD Opteron 64 computer, running 
64 bit Linux. Most of the running time is spent searching the corpus for phrases. It took 
16 hours and 27 minutes for Wumpus to fetch the 1,996,464 phrases. The remaining steps 
took 52 minutes, of which SVD took 10 minutes. The running time could be cut in half by 
using RAID to speed up disk access. 

Table 6 shows the performance of LRME in its baseline configuration. For comparison, 
the agreement of the 22 volunteers with our intended mapping has been copied from Table 5. 
The difference between the performance of LRME (91.5%) and the human participants 
(87.6%) is not statistically significant (paired t-test, 95% confidence level). 



Accuracy 



Mapping 


Source — > Target 


LRME 


Humans 


Al 


solar system — ► atom 


100.0 


90.9 


A2 


water flow — ► heat transfer 


100.0 


86.9 


A3 


waves — ► sounds 


100.0 


81.8 


A4 


combustion — + respiration 


100.0 


79.0 


A5 


sound — ► light 


71.4 


79.2 


A6 


projectile — > planet 


100.0 


97.4 


A7 


artificial selection — > natural selection 


71.4 


74.7 


A8 


billiard balls — > gas molecules 


100.0 


88.1 


A9 


computer — * mind 


55.6 


84.3 


A10 


slot machine — ► bacterial mutation 


100.0 


83.6 


Ml 


war — > argument 


71.4 


93.5 


M2 


buying an item — > accepting a belief 


100.0 


96.1 


M3 


grounds for a building — > reasons for a theory 


100.0 


87.9 


M4 


impediments to travel — > difficulties 


100.0 


100.0 


M5 


money — > time 


100.0 


77.3 


M6 


seeds — > ideas 


100.0 


89.0 


M7 


machine — > mind 


100.0 


98.7 


M8 


object — > idea 


60.0 


89.1 


M9 


following — > understanding 


100.0 


96.6 


M10 


seeing — > understanding 


100.0 


78.8 


Average 




91.5 


87.6 



Table 6: LRME in its baseline configuration, compared with human performance. 



In Table 6, the column labeled Humans is the average of 22 people, whereas the LRME 
column is only one algorithm (it is not an average) . Comparing an average of several scores 
to an individual score (whether the individual is a human or an algorithm) may give a 
misleading impression. In the results for any individual person, there are typically several 
100% scores and a few scores in the 55-75% range. The average mapping problem has seven 
terms. It is not possible to have exactly one term mapped incorrectly; if there are any 
incorrect mappings, then there must be two or more incorrect mappings. This follows from 
the nature of bijections. Therefore a score of 5/7 = 71.4% is not uncommon. 



631 



TURNEY 



Table 7 looks at the results from another perspective. The column labeled LRME wrong 
gives the number of incorrect mappings made by LRME for each of the twenty problems. 
The five columns labeled Number of people with N wrong show, for various values of N, 
how may of the 22 people made N incorrect mappings. For the average mapping problem, 
15 out of 22 participants had a perfect score (N = 0); of the remaining 7 participants, 5 
made only two mistakes (N = 2). Table 7 shows more clearly than Table 6 that LRME's 
performance is not significantly different from (individual) human performance. (For yet 
another perspective, see Section 9.1). 





LRME 


Number of people 


with N 


wrong 




Mapping 


wrong 


N = 


N = 1 


N = 


2 N = 


3 N > 4 


m 


Al 





16 





4 


2 





7 


A2 





14 





5 





3 


8 


A3 





9 





9 


2 


2 


8 


A4 





9 





9 





4 


8 


A5 


2 


10 





7 


2 


3 


7 


A6 





20 





2 








7 


A7 


2 


8 





6 


6 


2 


7 


A8 





13 





8 





1 


8 


A9 


4 


11 





7 


2 


2 


9 


A10 





13 





9 








5 


Ml 


2 


17 





5 








7 


M2 





19 





3 








7 


M3 





14 





8 








6 


M4 





22 














7 


M5 





9 





11 





2 


6 


M6 





15 





4 


3 





7 


M7 





21 





1 








7 


M8 


2 


18 





2 


1 


1 


5 


M9 





19 





3 








8 


M10 





13 





3 


3 


3 


6 


Average 


1 


15 





5 


1 


1 


7 



Table 7: Another way of viewing LRME versus human performance. 



In Table 8, we examine the sensitivity of LRME to the parameter settings. The first row 
shows the accuracy of the baseline configuration, as in Table 6. The next eight rows show 
the impact of varying k, the dimensionality of the truncated singular value decomposition, 
from 50 to 400. The eight rows after that show the effect of varying t, the column factor, 
from 5 to 40. The number of columns in the matrix (n c ) is given by the number of rows (n r 
= 1,662) multiplied by t. The second last row shows the effect of eliminating the singular 
value decomposition from LRME. This is equivalent to setting k to 1,662, the number 
of rows in the matrix. The final row gives the result when PPMIC (Bullinaria & Levy, 
2007) is replaced with log entropy (Turney, 2006). LRME is not sensitive to any of these 
manipulations: None of the variations in Table 8 perform significantly differently from the 
baseline configuration (paired t-test, 95% confidence level). (This does not necessarily mean 
that the manipulations have no effect; rather, it suggests that a larger sample of problems 
would be needed to show a significant effect.) 
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Experiment 


h 
K 


J. 
1 


n c 


Accuracy 


baseline configuration 


Qflfl 

oUU 


zl) 


OO O/l A 

66, z4U 


yi.o 




^n 


zu 


oo O/i n 


oy.o 




1 AA 


on 
zU 


o O O ,1 A 

33,240 
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1 CA 

15U 


on 
Z(J 


O O O A A 

33,240 


yi.o 




OAA 

200 


on 
z(J 


O O O /I A 

33,240 


no a 


varying k 


250 


20 


33,240 


90.6 




OAA 

300 


OA 

20 


O O O ,1 A 

33,240 


A1 C 

91.5 




OCA 

350 


OA 

20 


O O O A A 

33,240 


AA £ 

90.6 




A AA 

4UU 


OA 

20 


O O O A A 

33,240 


aa a 
90.6 




oUU 





o,olU 


fifi Q 
oO.y 




inn 

OUU 


1 n 

1U 




QzL n 




300 


15 


24,930 


94.0 


varying t 


300 


20 


33,240 


91.5 


300 


25 


41,550 


90.1 




300 


30 


49,860 


90.6 




300 


35 


58,170 


89.5 




300 


40 


66,480 


91.7 


dropping SVD 


1662 


20 


33,240 


89.7 


log entropy 


300 


20 


33,240 


83.9 



Table 8: Exploring the sensitivity of LRME to various parameter settings and modifications. 



8. Attribute Mapping Approaches 

In this section, we explore a variety of attribute mapping approaches for the twenty mapping 
problems. All of these approaches seek the mapping M a that maximizes the sum of the 
attributional similarities. 

m 

M a = argmax V] sim a (aj, M(a,)) (36) 

M€P(A,B) i=1 

We search for M a by exhaustively evaluating all of the possibilities. Ties are broken ran- 
domly. We use a variety of different algorithms to calculate sim a . 

8.1 Algorithms 

In the following experiments, we test five lexicon-based attributional similarity measures 
that use WordNet: 7 HSO (Hirst & St-Onge, 1998), JC (Jiang & Conrath, 1997), LC (Lea- 
cock & Chodrow, 1998), LIN (Lin, 1998), and RES (Resnik, 1995). All five are implemented 
in the Perl package WordNet:: Similarity, 8 which builds on the WordNet ::QueryData 9 pack- 
age. The core idea behind them is to treat WordNet as a graph and measure the semantic 
distance between two terms by the length of the shortest path between them in the graph. 
Similarity increases as distance decreases. 

7. WordNet was developed by a team at Princeton and it is available at http://wordnet.princeton.edu/. 

8. Ted Pedersen's WordNet::Similarity package is at http://www.cl. umn.edu/~tpederse/similarity.html. 

9. Jason Rennie's WordNet: :QueryData package is at http://people.csail.mit.edu/jrennie/WordNet/. 
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HSO works with nouns, verbs, adjectives, and adverbs, but JC, LC, LIN, and RES only 
work with nouns and verbs. We used WordNet:: Similarity to try all possible parts of speech 
and all possible senses for each input word. Many adjectives, such as true and valuable, 
also have noun and verb senses in WordNet, so JC, LC, LIN, and RES are still able to 
calculate similarity for them. When the raw form of a word is not found in WordNet, 
WordNet:: Similarity searches for morphological variations of the word. When there are 
multiple similarity scores, for multiple parts of speech and multiple senses, we select the 
highest similarity score. When there is no similarity score, because a word is not in WordNet, 
or because JC, LC, LIN, or RES could not find an alternative noun or verb form for an 
adjective or adverb, we set the score to zero. 

We also evaluate two corpus-based attributional similarity measures: PMI-IR (Turney, 
2001) and LSA (Landauer & Dumais, 1997). The core idea behind them is that "a word 
is characterized by the company it keeps" (Firth, 1957). The similarity of two terms is 
measured by the similarity of their statistical distributions in a corpus. We used the corpus 
of Section 7 along with Wumpus to implement PMI-IR (Pointwise Mutual Information 
with Information Retrieval). For LSA (Latent Semantic Analysis), we used the online 
demonstration. 10 We selected the Matrix Comparison option with the General Reading up 
to 1st year college (300 factors) topic space and the term-to-term comparison type. PMI-IR 
and LSA work with all parts of speech. 

Our eighth similarity measure is based on the observation that our intended mappings 
map terms that have the same part of speech (see Appendix A). Let POS(a) be the part- 
of-speech tag assigned to the term a. We use part-of-speech tags to define a measure of 
attributional similarity, sim POS (a, b), as follows. 

{100 if a = b 
10 if POS(a) = POS(6) (37) 
otherwise 

We hand-labeled the terms in the mapping problems with part-of-speech tags (Santorini, 
1990). Automatic taggers assume that the words that are to be tagged are embedded in 
a sentence, but the terms in our mapping problems are not in sentences, so their tags are 
ambiguous. We used our knowledge of the intended mappings to manually disambiguate 
the part-of-speech tags for the terms, thus guaranteeing that corresponding terms in the 
intended mapping always have the same tags. 

For each of the first seven attributional similarity measures above, we created seven more 
similarity measures by combining them with sim POS (a,6). For example, let sim HSO (a,6) be 
the Hirst and St-Onge (1998) similarity measure. We combine sim POS (a, b) and sim HSO (a, b) 
by simply adding them. 

sim H so+pos(a, b) = sim HSO (a, b) + sim POS (a, b) (38) 

The values returned by sim PO s(a, b) range from to 100, whereas the values returned by 
sim HSO (a, b) are much smaller. We chose large values in (37) so that getting POS tags to 
match up has more weight than any of the other similarity measures. The manual POS tags 

10. The online demonstration ol LSA is the work ol a team at the University of Colorado at Boulder. It is 
available at http://lsa.colorado.edu/. 
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and the high weight of sim POS (a, b) give an unfair advantage to the attributional mapping 
approach, but the relational mapping approach can afford to be generous. 

8.2 Experiments 

Table 9 presents the accuracy of the various measures of attributional similarity. The 
best result without POS labels is 55.9% (HSO). The best result with POS labels is 76.8% 
(LIN+POS). The 91.5% accuracy of LRME (see Table 6) is significantly higher than the 
76.8% accuracy of LIN+POS (and thus, of course, significantly higher than everything else 
in Table 9; paired t-test, 95% confidence level). The average human performance of 87.6% 
(see Table 5) is also significantly higher than the 76.8% accuracy of LIN+POS (paired t-test, 
95% confidence level). In summary, humans and LRME perform significantly better than 
all of the variations of attributional mapping approaches that were tested. 



Algorithm 


Reference 


Accuracy 


HSO 


Hirst and St-Onge (1998) 


55.9 


JC 


Jiang and Conrath (1997) 


54.7 


LC 


Leacock and Chodrow (1998) 


48.5 


LIN 


Lin (1998) 


48.2 


RES 


Resnik (1995) 


43.8 


PMI-IR 


Turncy (2001) 


54.4 


LSA 


Landauer and Dumais (1997) 


39.6 


POS (hand-labeled) 


Santorini (1990) 


44.8 


HSO+POS 


Hirst and St-Onge (1998) 


71.1 


JC+POS 


Jiang and Conrath (1997) 


73.6 


LC+POS 


Leacock and Chodrow (1998) 


69.5 


LIN+POS 


Lin (1998) 


76.8 


RES+POS 


Resnik (1995) 


71.6 


PMI-IR+POS 


Turncy (2001) 


72.8 


LSA+POS 


Landauer and Dumais (1997) 


65.8 



Table 9: The accuracy of attribute mapping approaches for a wide variety of measures of 
attributional similarity. 



9. Discussion 

In this section, we examine three questions that are suggested by the preceding results. 
Is there a difference between the science analogy problems and the common metaphor 
problems? Is there an advantage to combining the relational and attributional mapping ap- 
proaches? What is the advantage of the relational mapping approach over the attributional 
mapping approach? 

9.1 Science Analogies versus Common Metaphors 

Table 5 suggests that science analogies may be more difficult than common metaphors. This 
is supported by Table 10, which shows how the agreement of the 22 participants with our 
intended mapping (see Section 6) varies between the science problems and the metaphor 
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problems. The science problems have a lower average performance and greater variation in 
performance. The difference between the science problems and the metaphor problems is 
statistically significant (paired t-test, 95% confidence level). 



Average Accuracy 



Participant 


All 20 


10 Science 


10 Metaphor 


1 


72.6 


59.9 


85.4 


2 


88.2 


85.9 


90.5 


3 


90.0 


86.3 


93.8 


4 


71.8 


56.4 


87.1 


5 


95.7 


94.2 


97.1 


G 


83.4 


83.9 


82.9 


7 


79.6 


73.6 


85.7 


8 


91.9 


95.0 


88.8 


9 


89.7 


90.0 


89.3 


10 


80.7 


81.4 


80.0 


11 


94.5 


95.7 


93.3 


12 


90.6 


87.4 


93.8 


13 


93.2 


89.6 


96.7 


14 


97.1 


94.3 


100.0 


15 


86.6 


88.5 


84.8 


16 


80.5 


80.2 


80.7 


17 


93.3 


89.9 


96.7 


18 


86.5 


78.9 


94.2 


19 


92.9 


96.0 


89.8 


20 


90.4 


84.1 


96.7 


21 


82.7 


74.9 


90.5 


22 


96.2 


94.9 


97.5 


Average 


87.6 


84.6 


90.7 


Standard deviation 


7.2 


10.8 


5.8 



Table 10: A comparison of the difficulty of the science problems versus the metaphor prob- 
lems for the 22 participants. The numbers in bold font are the scores that are 
above the scores of LRME. 

The average science problem has more terms (7.4) than the average metaphor problem 
(6.6), which might contribute to the difficulty of the science problems. However, Table 11 
shows that there is no clear relation between the number of terms in a problem (m in 
Table 5) and the level of agreement. We believe that people find the metaphor problems 
easier than the science problems because these common metaphors are entrenched in our 
language, whereas the science analogies are more peripheral. 

Table 12 shows that the 16 algorithms studied here perform slightly worse on the science 
problems than on the metaphor problems, but the difference is not statistically significant 
(paired t-test, 95% confidence level). We hypothesize that the attributional mapping ap- 
proaches are not performing well enough to be sensitive to subtle differences between science 
analogies and common metaphors. 

Incidentally, these tables give us another view of the performance of LRME in compar- 
ison to human performance. The first row in Table 12 shows the performance of LRME on 
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Num terms 


Agreement 


5 


86.4 


6 


81.3 


7 


91.1 


8 


86.5 


9 


84.3 



Table 11: The average agreement among the 22 participants as a function of the number of 
terms in the problems. 



Average Accuracy 



Algorithm 


All 20 10 Science 10 Metaphor 


LRME 


91.5 


89.8 


93.1 


HSO 


55.9 


57.4 


54.3 


JC 


54.7 


57.4 


52.1 


LC 


48.5 


49.6 


47.5 


LIN 


48.2 


46.7 


49.7 


RES 


43.8 


39.0 


48.6 


PMI-IR 


54.4 


49.5 


59.2 


LSA 


39.6 


37.3 


41.9 


POS 


44.8 


42.1 


47.4 


HSO+POS 


71.1 


66.9 


75.2 


JC+POS 


73.6 


78.1 


69.2 


LC+POS 


69.5 


70.8 


68.2 


LIN+POS 


76.8 


68.8 


84.8 


RES+POS 


71.6 


70.3 


72.9 


PMI-IR+POS 


72.8 


65.7 


79.9 


LSA+POS 


65.8 


69.1 


62.4 


Average 

Standard deviation 


61.4 
14.7 


59.9 
15.0 


62.9 
15.3 



Table 12: A comparison of the difficulty of the science problems versus the metaphor prob- 
lems for the 16 algorithms. 



the science and metaphor problems. In Table 10, we have marked in bold font the cases 
where human scores are greater than LRME's scores. For all 20 problems, there are 8 
such cases; for the 10 science problems, there are 8 such cases; for the 10 metaphor prob- 
lems, there are 10 such cases. This is further evidence that LRME's performance is not 
significantly different from human performance. LRME is near the middle of the range of 
performance of the 22 human participants. 

9.2 Hybrid Relational- Attributional Approaches 

Recall the definitions of score r (M) and score a (M) given in Section 3. 
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m m 



score r (M) = ^ ^ sim r (a i :a J -,M(a i ):M(a J -)) (39) 

i=l j=i+l 
m 

score a (M) = ^ sim a (ai, M (a*)) (40) 
i=i 

We can combine the scores by simply adding them or multiplying them, but score r (M) and 
score a (M) may be quite different in the scales and distributions of their values; therefore 
we first normalize them to probabilities. 

, ,„,n score r (M) /A . 

prob r M = — n ; — - 41 

Em iG p(A,b) score r (Mi) 

, score a (M) . , . 

prob a (M) = = V ' 42 

Em, € p(A,b) score a (M0 

For these probability estimates, we assume that score r (M) > and score a (M) > 0. If 
necessary, a constant value may be added to the scores, to ensure that they are not negative. 
Now we can combine the scores by adding or multiplying the probabilities. 



M r+a = argmax (prob r (M) + prob a (M)) (43) 

MeP(A,B) 

M r> < a = argmax (prob r (M) x prob a (M)) (44) 

MeP(A,B) 

Table 13 shows the accuracy when LRME is combined with LIN+POS (the best attri- 
butional mapping algorithm in Table 9, with an accuracy of 76.8%) or with HSO (the best 
attributional mapping algorithm that does not use the manual POS tags, with an accuracy 
of 55.9%). We try both adding and multiplying probabilities. On its own, LRME has an 
accuracy of 91.5%. Combining LRME with LIN+POS increases the accuracy to 94.0%, but 
this improvement is not statistically significant (paired t-test, 95% confidence level). Com- 
bining LRME with HSO results in a decrease in accuracy. The decrease is not significant 
when the probabilities are multiplied (85.4%), but it is significant when the probabilities 
are added (78.5%). 

In summary, the experiments show no significant advantage to combining LRME with 
attributional mapping. However, it is possible that a larger sample of problems would 
show a significant advantage. Also, the combination methods we explored (addition and 
multiplication of probabilities) are elementary. A more sophisticated approach, such as a 
weighted combination, may perform better. 

9.3 Coherent Relations 

We hypothesize that LRME benefits from a kind of coherence among the relations. On the 
other hand, attributional mapping approaches do not involve this kind of coherence. 
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Components 






Relational Attributional 


Combination 


Accuracy 


LRME LIN+POS 


add probabilities 


94.0 


LRME LIN+POS 


multiply probabilities 


94.0 


LRME HSO 


add probabilities 


78.5 


LRME HSO 


multiply probabilities 


85.4 



Table 13: The performance of four different hybrids of relational and attributional mapping 
approaches. 



Suppose we swap two of the terms in a mapping. Let M be the original mapping and 
let M' be the new mapping, where M'(a x ) = M(a 2 ), M'(a 2 ) = M(ai), and M'(a.{) = M(a,) 
for i > 2. With attributional similarity, the impact of this swap on the score of the mapping 
is limited. Part of the score is not affected. 

m 

score a (M) = sim a (ai, M{a 1 )) + sim a (a 2 , M(a 2 )) + ^ sim a (a.j, M{aj)) (45) 

i=3 
m 

score a (M') = sim a (ai, M(a 2 )) + sim a (a 2 , M{a{)) + ^ sim a (aj, M(aj)) (46) 

i=3 

On the other hand, with relational similarity, the impact of a swap is not limited in this 
way. A change to any part of the mapping affects the whole score. There is a kind of global 
coherence to relational similarity that is lacking in attributional similarity. 

Testing the hypothesis that LRME benefits from coherence is somewhat complicated, 
because we need to design the experiment so that the coherence effect is isolated from any 
other effects. To do this, we move some of the terms outside of the accuracy calculation. 

Let M* : A — > B be one of our twenty mapping problems, where M* is our intended 
mapping and m = \A\ = \B\. Let A' be a randomly selected subset of A of size m! . Let B' 
be M*(A'), the subset of B to which M* maps A'. 



A' C A 


(47) 


B' C B 


(48) 


B' = M*{A') 


(49) 


m' = \A'\ = \B'\ 


(50) 


m! < m 


(51) 



There are two ways that we might use LRME to generate a mapping M' : A' — > B' for this 
new reduced mapping problem, internal coherence and total coherence. 

1. Internal coherence: We can select M' based on (A',B f ) alone. 
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A! = {a±, —,a m '} (52) 
B' = {h, b m t} (53) 



m' m! 

M' = argmax V] V] sim r (aj :aj, M(a«) :M(aj)) (54) 

MeP(A',B') l=lj=l+1 

In this case, M' is chosen based only on the relations that are internal to (A',B'). 

2. Total coherence: We can select M' based on (A, B) and the knowledge that M' 
must satisfy the constraint that M'(A') = B'. (This knowledge is also embedded in 
internal coherence.) 



A = {ai, ...,a m } (55) 

B = {&i,.. .,b m } (56) 

P'(A, B) = [M\ M € P(A, B) and M(A') = B'} (57) 

m m 

M' = argmax sim r (aj :aj, M(aj) :M(aj)) (58) 

MeP'(A,B) i=lj=i+1 



In this case, M' is chosen using both the relations that are internal to (^4', B 1 ) and 
other relations in (A,B) that are external to (A',B'). 



Suppose that we calculate the accuracy of these two methods based only on the sub- 
problem (A',B'). At first it might seem that there is no advantage to total coherence, 
because it must explore a larger space of possible mappings than internal coherence (since 
\P'(A,B)\ is larger than \P(A' , B')\), but the additional terms that it explores are not 
involved in calculating the accuracy. However, we hypothesize that total coherence will 
have a higher accuracy than internal coherence, because the additional external relations 
help to select the correct mapping. 

To test this hypothesis, we set m! to 3 and we randomly generated ten new reduced 
mapping problems for each of the twenty problems (i.e., a total of 200 new problems of size 
3). The average accuracy of internal coherence was 93.3%, whereas the average accuracy 
of total coherence was 97.3%. The difference is statistically significant (paired t-test, 95% 
confidence level). 

On the other hand, the attributional mapping approaches cannot benefit from total 
coherence, because there is no connection between the attributes that are in (A', B') and 
the attributes that are outside. We can decompose score a (M) into two independent parts. 
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A 



A\A' 
A' U A" 
{M\ M € 



(59) 
(60) 
(61) 
(62) 



A 

P'(A,B) 
M' 



arg max 
MeP'(A,B) 



P{A,B) and M(A') = B'} 



^2 sim a (a i ,M(a i )) 



arg max 

M£P'(A,B) 



^2 sim a (aj,M(aj)) + ^ sim a (ai, M(a;)) 



(63) 




These two parts can be optimized independently. Thus the terms that are external to 
(A',B') have no influence on the part of M' that covers (A',B'). 

Relational mapping cannot be decomposed into independent parts in this way, because 
the relations connect the parts. This gives relational mapping approaches an inherent 
advantage over attributional mapping approaches. 

To confirm this analysis, we compared internal and total coherence using LIN+POS 
on the same 200 new problems of size 3. The average accuracy of internal coherence was 
88.0%, whereas the average accuracy of total coherence was 87.0%. The difference is not 
statistically significant (paired t-test, 95% confidence level). (The only reason that there is 
any difference is that, when two mappings have the same score, we break the ties randomly. 
This causes random variation in the accuracy.) 

The benefit from coherence suggests that we can make analogy mapping problems easier 
for LRME by adding more terms. The difficulty is that the new terms cannot be randomly 
chosen; they must fit with the logic of the analogy and not overlap with the existing terms. 

Of course, this is not the only important difference between the relational and attribu- 
tional mapping approaches. We believe that the most important difference is that relations 
are more reliable and more general than attributes, when using past experiences to make 
predictions about the future (Hofstadter, 2001; Gentner, 2003). Unfortunately, this hypoth- 
esis is more difficult to evaluate experimentally than our hypothesis about coherence. 

10. Related Work 

French (2002) gives a good survey of computational approaches to analogy-making, from the 
perspective of cognitive science (where the emphasis is on how well computational systems 
model human performance, rather than how well the systems perform). We will sample a 
few systems from his survey and add a few more that were not mentioned. 

French (2002) categorizes analogy-making systems as symbolic, connectionist, or symbolic- 
connectionist hybrids. Gardenfors (2004) proposes another category of representational 
systems for AI and cognitive science, which he calls conceptual spaces. These spatial or geo- 
metric systems are common in information retrieval and machine learning (Widdows, 2004; 
van Rijsbergen, 2004). An influential example is Latent Semantic Analysis (Landauer & 
Dumais, 1997). The first spatial approaches to analogy-making began to appear around the 
same time as French's (2002) survey. LRME takes a spatial approach to analogy-making. 
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10.1 Symbolic Approaches 

Computational approaches to analogy-making date back to ANALOGY (Evans, 1964) and 
Argus (Reitman, 1965). Both of these systems were designed to solve proportional analogies 
(analogies in which \A\ = \B\ = 2; see Section 4). Analogy could solve proportional 
analogies with simple geometric figures and Argus could solve simple word analogies. These 
systems used hand-coded rules and were only able to solve the limited range of problems 
that their designers had anticipated and coded in the rules. 

French (2002) cites Structure Mapping Theory (SMT) (Gentner, 1983) and the Structure 
Mapping Engine (SME) (Falkenhainer et al., 1989) as the prime examples of symbolic 
approaches: 

SMT is unquestionably the most influential work to date on the modeling of 
analogy-making and has been applied in a wide range of contexts ranging from 
child development to folk physics. SMT explicitly shifts the emphasis in analogy- 
making to the structural similarity between the source and target domains. Two 
major principles underlie SMT: 

• the relation-matching principle: good analogies are determined by map- 
pings of relations and not attributes (originally only identical predicates 
were mapped) and 

• the systematicity principle: mappings of coherent systems of relations are 
preferred over mappings of individual relations. 

This structural approach was intended to produce a domain-independent map- 
ping process. 

LRME follows both of these principles. LRME uses only relational similarity; no attribu- 
tional similarity is involved (see Section 7.1). Coherent systems of relations are preferred 
over mappings of individual relations (see Section 9.3). However, the spatial (statistical, 
corpus-based) approach of LRME is quite different from the symbolic (logical, hand-coded) 
approach of SME. 

Martin (1992) uses a symbolic approach to handle conventional metaphors. Gentner, 
Bowdle, Wolff, and Boronat (2001) argue that novel metaphors are processed as analogies, 
but conventional metaphors are recalled from memory without special processing. However, 
the line between conventional and novel metaphor can be unclear. 

Dolan (1995) describes an algorithm that can extract conventional metaphors from a 
dictionary. A semantic parser is used to extract semantic relations from the Longman 
Dictionary of Contemporary English (LDOCE). A symbolic algorithm finds metaphorical 
relations between words, using the extracted relations. 

Veale (2003, 2004) has developed a symbolic approach to analogy-making, using Word- 
Net as a lexical resource. Using a spreading activation algorithm, he achieved a score of 
43.0% on a set of 374 multiple-choice lexical proportional analogy questions from the SAT 
college entrance test (Veale, 2004). 

Lepage (1998) has demonstrated that a symbolic approach to proportional analogies can 
be used for morphology processing. Lepage and Denoual (2005) apply a similar approach 
to machine translation. 
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10.2 Connectionist Approaches 

Connectionist approaches to analogy-making include ACME (Holyoak & Thagard, 1989) 
and LISA (Hummel & Holyoak, 1997). Like symbolic approaches, these systems use hand- 
coded knowledge representations, but the search for mappings takes a connectionist ap- 
proach, in which there are nodes with weights that are incrementally updated over time, 
until the system reaches a stable state. 

10.3 Symbolic-Connectionist Hybrid Approaches 

The third family examined by French (2002) is hybrid approaches, containing elements 
of both the symbolic and connectionist approaches. Examples include Copycat (Mitchell, 
1993) and Tabletop (French, 1995). Much of the work in the Fluid Analogies Research 
Group (FARG) concerns symbolic-connectionist hybrids (Hofstadter & FARG, 1995). 

10.4 Spatial Approaches 

Marx, Dagan, Buhmann, and Shamir (2002) present the coupled clustering algorithm, which 
uses a feature vector representation to find analogies in collections of text. For example, 
given documents on Buddhism and Christianity, it finds related terms, such as {school, 
Mahay ana, Zen} for Buddhism and {tradition, Catholic, Protestant} for Christianity. 

Mason (2004) describes the CorMet system for extracting conventional metaphors from 
text. CorMet is based on clustering feature vectors that represent the selectional preferences 
of verbs. Given keywords for the source domain laboratory and the target domain finance, 
it is able to discover mappings such as liquid — > income and container — > institution. 

Turney, Littman, Bigham, and Shnayder (2003) present a system for solving lexical 
proportional analogy questions from the SAT college entrance test, which combines thirteen 
different modules. Twelve of the modules use either attributional similarity or a symbolic 
approach to relational similarity, but one module uses a spatial (feature vector) approach 
to measuring relational similarity. This module worked much better than any of the other 
modules; therefore, it was studied in more detail by Turney and Littman (2005). The 
relation between a pair of words is represented by a vector, in which the elements are pattern 
frequencies. This is similar to LRME, but one important difference is that Turney and 
Littman (2005) used a fixed, hand-coded set of 128 patterns, whereas LRME automatically 
generates a variable number of patterns from the given corpus (33,240 patterns in our 
experiments here). 

Turney (2005) introduced Latent Relational Analysis (LRA), which was examined more 
thoroughly by Turney (2006). LRA achieves human-level performance on a set of 374 
multiple-choice proportional analogy questions from the SAT college entrance exam. LRME 
uses a simplified form of LRA. A similar simplification of LRA is used by Turney (2008), in 
a system for processing analogies, synonyms, antonyms, and associations. The contribution 
of LRME is to go beyond proportional analogies, to larger systems of analogical mappings. 

10.5 General Theories of Analogy and Metaphor 

Many theories of analogy-making and metaphor either do not involve computation or they 
suggest general principles and concepts that are not specific to any particular computational 
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approach. The design of LRME has been influenced by several theories of this type (Gentner, 
1983; Hofstadter & FARG, 1995; Holyoak & Thagard, 1995; Hofstadter, 2001; Gentner, 
2003). 

Lakoff and Johnson (1980) provide extensive evidence that metaphor is ubiquitous in 
language and thought. We believe that a system for analogy-making should be able to 
handle metaphorical language, which is why ten of our analogy problems are derived from 
Lakoff and Johnson (1980). We agree with their claim that a metaphor does not merely 
involve a superficial relation between a couple of words; rather, it involves a systematic set 
of mappings between two domains. Thus our analogy problems involve larger sets of words, 
beyond proportional analogies. 

Holyoak and Thagard (1995) argue that analogy-making is central in our daily thought, 
and especially in finding creative solutions to new problems. Our ten scientific analogies 
were derived from their examples of analogy-making in scientific creativity. 

11. Limitations and Future Work 

In Section 4, we mentioned ten applications for LRA, and in Section 5 we claimed that the 
results of the experiments in Section 9.3 suggest that LRME may perform better than LRA 
on all ten of these applications, due to its ability to handle bijective analogies when m > 2. 
Our focus in future work will be testing this hypothesis. In particular, the task of semantic 
role labeling, discussed in Section 1, seems to be a good candidate application for LRME. 

The input to LRME is simpler than the input to SME (compare Figures 1 and 2 in 
Section 1 with Table 1), but there is still some human effort involved in creating the input. 
LRME is not immune to the criticism of Chalmers, French, and Hofstadter (1992), that 
the human who generates the input is doing more work than the computer that makes the 
mappings, although it is not a trivial matter to find the right mapping out of 5,040 (7!) 
choices. 

In future work, we would like to relax the requirement that (A, B) must be a bijection 
(see Section 3), by adding irrelevant words (distractors) and synonyms. The mapping 
algorithm will be forced to decide what terms to include in the mapping and what terms 
to leave out. 

We would also like to develop an algorithm that can take a proportional analogy (m = 2) 
as input (e.g., sun:planet::nucleus:electron) and automatically expand it to a larger analogy 
(m > 2, e.g., Table 2). That is, it would automatically search the corpus for new terms to 
add to the analogy. 

The next step would be to give the computer only the topic of the source domain (e.g., 
solar system) and the topic of the target domain (e.g., atomic structure), and let it work 
out the rest on its own. This might be possible by combining ideas from LRME with ideas 
from coupled clustering (Marx et al., 2002) and CorMet (Mason, 2004). 

It seems that analogy-making is triggered in people when we encounter a problem 
(Holyoak & Thagard, 1995). The problem defines the target for us, and we immediately 
start searching for a source. Analogical mapping enables us to transfer our knowledge of 
the source to the target, hopefully leading to a solution to the problem. This suggests that 
the input to the ideal analogical mapping algorithm would be simply a statement that there 
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is a problem (e.g., What is the structure of the atom?). Ultimately, the computer might 
find the problems on its own as well. The only input would be a large corpus. 

The algorithms we have considered here all perform exhaustive search of the set of 
possible mappings P(A,B). This is acceptable when the sets are small, as they are here, 
but it will be problematic for larger problems. In future work, it will be necessary to use 
heuristic search algorithms instead of exhaustive search. 

It takes almost 18 hours for LRME to process the twenty mapping problems (Section 7). 
With better hardware and some changes to the software, this time could be significantly 
reduced. For even greater speed, the algorithm could run continuously, building a large 
database of vector representations of term pairs, so that it is ready to create mappings as 
soon as a user requests them. This is similar to the vision of Banko and Etzioni (2007). 

LRME, like LRA and LSA (Landauer & Dumais, 1997), uses a truncated singular value 
decomposition (SVD) to smooth the matrix. Many other algorithms have been proposed 
for smoothing matrices. In our past work with LRA (Turney, 2006), we experimented with 
Nonnegative Matrix Factorization (NMF) (Lee & Seung, 1999), Probabilistic Latent Se- 
mantic Analysis (PLSA) (Hofmann, 1999), Iterative Scaling (IS) (Ando, 2000), and Kernel 
Principal Components Analysis (KPCA) (Scholkopf, Smola, & Muller, 1997). We had some 
interesting results with small matrices (around 1000 x 2000), but none of the algorithms 
seemed substantially better than truncated SVD, and none of them scaled up to the matrix 
sizes that we have here (1,662 x 33,240). However, we believe that SVD is not unique, and 
future work is likely to discover a smoothing algorithm that is more efficient and effective 
than SVD. The results in Section 7.2 do not show a significant benefit from SVD. Table 8 
hints that PPMIC (Bullinaria & Levy, 2007) is more important than SVD. 

LRME extracts knowledge from many fragments of text. In Section 7.1, we noted 
that we found an average of 1,180 phrases per pair. The information from these 1,180 
phrases is combined in a vector, to represent the semantic relation for a pair. This is 
quite different from relation extraction in (for example) the Automatic Content Extraction 
(ACE) Evaluation. 11 The task in ACE is to identify and label a semantic relation in a single 
sentence. Semantic role labeling also involves labeling a single sentence (Gildea & Jurafsky, 
2002). 

The contrast between LRME and ACE is analogous to the distinction in cognitive 
psychology between semantic and episodic memory. Episodic memory is memory of a 
specific event in one's personal past, whereas semantic memory is memory of basic facts and 
concepts, unrelated to any specific event in the past. LRME extracts relational information 
that is independent of any specific sentence, like semantic memory. ACE is concerned with 
extracting the relation in a specific sentence, like episodic memory. In cognition, episodic 
memory and semantic memory work together synergistically. When we experience an event, 
we use our semantic memory to interpret the event and form a new episodic memory, 
but semantic memory is itself constructed from our past experiences, our accumulated 
episodic memories. This suggests that there should be a synergy from combining LRME- like 
semantic information extraction algorithms with ACE-like episodic information extraction 
algorithms. 



11. ACE is an annual event that began in 1999. Relation Detection and Characterization (RDC) was 
introduced to ACE in 2001. For more information, see http://www.nist.gov/speech/tests/ace/. 



645 



TURNEY 



12. Conclusion 

Analogy is the core of cognition. We understand the present by analogy to the past. We 
predict the future by analogy to the past and the present. We solve problems by searching 
for analogous situations (Holyoak & Thagard, 1995). Our daily language is saturated with 
metaphor (Lakoff & Johnson, 1980), and metaphor is based on analogy (Gentner et al., 
2001). To understand human language, to solve human problems, to work with humans, 
computers must be able to make analogical mappings. 

Our best theory of analogy-making is Structure Mapping Theory (Gentner, 1983), but 
the Structure Mapping Engine (Falkenhainer et al., 1989) puts too much of the burden 
of analogy-making on its human users (Chalmers et al., 1992). LRME is an attempt to 
shift some of that burden onto the computer, while remaining consistent with the general 
principles of SMT. 

We have shown that LRME is able to solve bijective analogical mapping problems with 
human-level performance. Attributional mapping algorithms (at least, those we have tried 
so far) are not able to reach this level. This supports SMT, which claims that relations are 
more important than attributes when making analogical mappings. 

There is still much research to be done. LRME takes some of the load off the human 
user, but formulating the input to LRME is not easy. This paper is an incremental step 
towards a future in which computers can make surprising and useful analogies with minimal 
human assistance. 
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Appendix A. Details of the Mapping Problems 

In this appendix, we provide detailed information about the twenty mapping problems. 
Figure 3 shows the instructions that were given to the participants in the experiment in 
Section 6. These instructions were displayed in their web browsers. Tables 14, 15, 16, 
and 17 show the twenty mapping problems. The first column gives the problem number 
(e.g., Al) and a mnemonic that summarizes the mapping (e.g., solar system — > atom). The 
second column gives the source terms and the third column gives the target terms. 

The mappings shown in these tables are our intended mappings. The fourth column 
shows the percentage of participants who agreed with our intended mappings. For example, 
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Systematic Analogies and Metaphors 
Instructions 

You will be presented with twenty analogical mapping problems, ten based on scientific 
analogies and ten based on common metaphors. A typical problem will look like this: 



horse 




? 


V 


legs 




? 


V 


hay 




? 


V 


brain 




? 


V 


dung 




? 


V 



You may click on the drop-down menus above, to see what options are available. 

Your task is to construct an analogical mapping; that is, a one-to-one mapping between the 
items on the left and the items on the right. For example: 



horse — > 


car 


V 


legs — > 


wheels 


V 


hay -> 


gasoline 


V 


brain — > 


driver 


V 


dung — > 


exhaust 


V 



This mapping expresses an analogy between a horse and a car. The horse's legs are like the 
car's wheels. The horse eats hay and the car consumes gasoline. The horse's brain controls 
the movement of the horse like the car's driver controls the movement of the car. The horse 
generates dung as a waste product like the car generates exhaust as a waste product. 

You should have no duplicate items in your answers on the right-hand side. If there are 
any duplicates or missing items (question marks), you will get an error message when you 
submit your answer. 

You are welcome to use a dictionary as you work on the problems, if you would find it 
helpful. 

If you find the above instructions unclear, then please do not continue with this exercise. 
Your answers to the twenty problems will be used as a standard for evaluating the output 
of a computer algorithm; therefore, you should only proceed if you are confident that you 
understand this task. 

Figure 3: The instructions for the participants in the experiment in Section 6. 
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Mapping 



Source 



Target 



Agreement POS 



Al 

solar system 
— ► atom 



solar system 

sun 

planet 

mass 

attracts 

revolves 

gravity 



atom 

nucleus 

electron 

charge 

attracts 

revolves 

electromagnetism 



Average agreement: 



A2 

water flow 

— > heat transfer 



water 

flows 

pressure 

water tower 

bucket 

filling 

emptying 

hydrodynamics 



heat 

transfers 

temperature 

burner 

kettle 

heating 

cooling 

thermodynamics 



Average agreement: 



86.4 
100.0 
95.5 
86.4 
90.9 
95.5 
81.8 



90.9 



86.4 
95.5 
86.4 
72.7 
72.7 
95.5 
95.5 
90.9 



86.9 



NN 

NN 

NN 

NN 

VBZ 

VBZ 

NN 



NN 

VBZ 

NN 

NN 

NN 

VBG 

VBG 

NN 





waves 


— > sounds 


86.4 


NNS 


A3 


shore 


— ► wall 


77.3 


NN 




reflects 


— > echoes 


95.5 


VBZ 


waves 


water 


— ► air 


95.5 


NN 


— ► sounds 


breakwater 


— ► insulation 


81.8 


NN 




rough 


— ► loud 


63.6 


JJ 




calm 


— ► quiet 


100.0 


JJ 




crashing 


— ► vibrating 


54.5 


VBG 




Average agreement: 




81.8 







combustion 


— > respiration 


72.7 


NN 


A4 


fire 


— > animal 


95.5 


NN 




fuel 


— ► food 


90.9 


NN 


combustion 


burning 


— > breathing 


72.7 


VBG 


— ► respiration 


hot 


— ► living 


59.1 


JJ 




intense 


— ► vigorous 


77.3 


JJ 




oxygen 


— ► oxygen 


77.3 


NN 




carbon dioxide 


— ► carbon dioxide 


86.4 


NN 




Average agreement: 




79.0 







sound 


-> light 


86.4 


NN 


A5 


low 


— ► red 


50.0 


JJ 




high 


— > violet 


54.5 


JJ 


sound 


echoes 


— > reflects 


100.0 


VBZ 


-> light 


loud 


— ► bright 


90.9 


JJ 




quiet 


— ► dim 


77.3 


JJ 




horn 


— ► lens 


95.5 


NN 




Average agreement: 




79.2 





Table 14: Science analogy problems Al to A5, derived from Chapter 8 of Holyoak and 
Thagard (1995). 
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Mapping 


Source 


— > Target 


Agreement 


POS 




projectile 


— > planet 


100.0 


NN 


A6 


trajectory 


— > orbit 


100.0 


NN 




earth 


— > sun 


100.0 


NN 


projectile 


parabolic 


— > elliptical 


100.0 


JJ 


— > planet 


air 


— > space 


100.0 


NN 




gravity 


— > gravity 


90.9 


NN 




attracts 


— > attracts 


90.9 


VBZ 




Average agreement: 




97.4 







breeds 


— > species 


100.0 


NNS 


A7 


selection 


— > competition 


59.1 


NN 




conformance 


— > adaptation 


59.1 


NN 


artificial selection 


artificial 


— ► natural 


77.3 


JJ 


— > natural selection 


popularity 


— > fitness 


54.5 


NN 




breeding 


— > mating 


95.5 


VBG 




domesticated 


— > wild 


77.3 


JJ 




Average agreement: 




74.7 







balls 


— > molecules 


90.9 


NNS 


A8 


billiards 


-> gas 


72.7 


NN 




speed 


— ► temperature 


81.8 


NN 


billiard balls 


table 


— > container 


95.5 


NN 


— > gas molecules 


bouncing 


— > pressing 


77.3 


VBG 




moving 


— > moving 


86.4 


VBG 




slow 


— > cold 


100.0 


JJ 




fast 


-> hot 


100.0 


JJ 




Average agreement: 




88.1 







computer 


— » mind 


90.9 


NN 


A9 


processing 


— > thinking 


95.5 


VBG 




erasing 


— > forgetting 


100.0 


VBG 


computer 


write 


— > memorize 


72.7 


VB 


— ► mind 


read 


— > remember 


54.5 


VB 




memory 


— > memory 


81.8 


NN 




outputs 


— > muscles 


72.7 


NNS 




inputs 


— > senses 


90.9 


NNS 




bug 


— > mistake 


100.0 


NN 




Average agreement: 




84.3 





A10 

slot machine 

— ► bacterial mutation 



slot machines 

reels 

spinning 

winning 

losing 



bacteria 

genes 

mutating 

reproducing 

dying 



Average agreement: 



68.2 
72.7 
86.4 
90.9 
100.0 



83.6 



NNS 
NNS 
VBG 
VBG 
VBG 



Table 15: Science analogy problems A6 to A10, derived from Chapter 8 of Holyoak and 
Thagard (1995). 
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Mapping 



Source 



Target 



Agreement POS 



Ml 

war 

— > argument 



war 

soldier 

destroy 

fighting 

defeat 

attacks 

weapon 



argument 

debater 

refute 

arguing 

acceptance 

criticizes 

logic 



Average agreement: 



90.9 
100.0 
90.9 
95.5 
90.9 
95.5 
90.9 



93.5 



NN 

NN 

VB 

VBG 

NN 

VBZ 

NN 



M2 

buying an item 

— + accepting a belief 



buyer 

merchandise 

buying 

selling 

returning 

valuable 

worthless 



believer 

belief 

accepting 

advocating 

rejecting 

true 

false 



Average agreement: 



100.0 
90.9 
95.5 

100.0 
95.5 
95.5 
95.5 



96.1 



NN 

NN 

VBG 

VBG 

VBG 

JJ 

JJ 



M3 

grounds for a building 
— > reasons for a theory 



foundations 

buildings 

supporting 

solid 

weak 

crack 



reasons 

theories 

confirming 

rational 

dubious 

flaw 



Average agreement: 



72.7 
77.3 
95.5 
90.9 
95.5 
95.5 



87.9 



NNS 

NNS 

VBG 

JJ 

JJ 

NN 



M4 

impediments to travel 
— > difficulties 



obstructions 

destination 

route 

traveller 

travelling 

companion 

arriving 



difficulties 
goal 
plan 
person 

problem solving 

partner 

succeeding 



Average agreement: 



100.0 
100.0 
100.0 
100.0 
100.0 
100.0 
100.0 



100.0 



NNS 

NN 

NN 

NN 

VBG 

NN 

VBG 



M5 

money 
— > time 



money 

allocate 

budget 

effective 

cheap 

expensive 



time 

invest 

schedule 

efficient 

quick 

slow 



95.5 
86.4 
86.4 
86.4 
50.0 
59.1 



NN 

VB 

NN 

JJ 

JJ 

JJ 



Average agreement: 



77.3 



Table 16: Common metaphor problems Ml to M5, derived from Lakoff and Johnson (1980). 
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Ivlcippillg 


ijOUI ct* 


— > ±<il gtiTj 


tit3IIlt3IlT( 






seeds 


— ► ideas 


on o 
yu.y 


IN IN O 


1V1D 


planted 


— > inspired 


yo.o 


V 1)1/ 




11 lUlllll 


r piOClUCllVc 


Ol.O 


T T 


Qprrl y 

SCCUO 


TT*111 1 
11 Lll L 


\ ~r\r(^r\i'ir , T 


■cf'J . O 


NN 




grow 


\ flP"\7"ploTl 

' CLCVC1CJJJ 




V u 




W 1 L11L.1 


— ► fail 

7 Idll 


i nn n 


V 1 ) 




LHUooUIIl 


> bUcCCCCL 


77 3 


VR 
V I > 




A vptci cp affrppmpnt" 

Ti. V <^1 Cli fc-A Clitil <^\_/lll\_/llL . 




89.0 






llldClllllL. 


: — j 

*■ mind 




IN IN 


M7 


V V U 1 ivlllgj 


^ 1 n i n k i n o* 

' Lllllllvlllti 


i nn n 


V uu 




Lj HI 1H_A1 Ull 


^ CIWCIkP 

' <XVVCll\A. 


100.0 


JJ 


maciimc 


rnyn An att 

lUiiieu on 


— > asieep 


i nn n 


T T 
J J 


^ mind 


r"\r*/~\ \rr\t-\ 
\JL UrLCll 


> CCJlllUhCCl 


i nn n 


T T 




VJKjW CI 


' UlLCllltiCllCC 




NN 




repair 


— > therapy 


inn n 


WW 
IN IN 




iT. V CI Oj^jC ^fe^ CC111C11L . 




7 






oujeet 


vi 

*■ idea 


yu.y 


IN IN 


i\/rs 

Ivlo 


noia 


— > understand 


Ol.O 


VR 




wt(±\ cf n 
WClgll 


i. O VI O l^TTVp 

' diiiojiyZjC 


O J. . o 


VR 

V U 


<J U J CC u 


heavy 


— ^ ltTcnrirt fi nt 

' llllUUl LldrllLl 


95.5 


JJ 


> idea 


ngnr 


7 lllvlcLl 


yo.o 


T T 




A "trPY - ct (TP v\ ctvpptti Prri" * 
iT.VCldj^C cl£,l CC111C11L . 




8Q 1 
oy . i 






T/~\ 1 1 t~V\Ti J 

iuiiu w 


v nun atcto n /H 


i nn n 


VR 
V 1J 


iviy 


leaciei 


— > speaker 


i nn n 


NN 
IN IN 




path 


' dl g L1111L.11L 


i nn n 


NN 


1U11U W 111K, 


Tnll minor 

1U11U w CI 


' 11GLC11L.1 


i nn o 


NN 


— > 1 1 n n or^t 9 ti n l ti c 


lost 


— > mimiTidorstooH 


86.4 


JJ 




wanders 


— > digresses 


90.9 


VBZ 




twisted 


— > complicated 


95.5 


JJ 




straight 


— > simple 


100.0 


JJ 




Average agreement: 




96.6 






seeing 


— > understanding 


68.2 


VBG 


M10 


light 


— > knowledge 


77.3 


NN 




illuminating 


— > explaining 


86.4 


VBG 


seeing 


darkness 


— > confusion 


86.4 


NN 


— ► understanding 


view 


— > interpretation 


68.2 


NN 




hidden 


— > secret 


86.4 


JJ 




Average agreement: 




78.8 





Table 17: Common metaphor problems M6 to M10, derived from Lakoff and Johnson 
(1980). 
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in problem Al, 81.8% of the participants (18 out of 22) mapped gravity to electromagnetism. 
The final column gives the part-of-speech (POS) tags for the source and target terms. We 
used the Penn Treebank tags (Santorini, 1990). We assigned these tags manually. Our 
intended mappings and our tags were chosen so that mapped terms have the same tags. 
For example, in Al, sun maps to nucleus, and both sun and nucleus are tagged NN. The 
POS tags are used in the experiments in Section 8. The POS tags are not used by LRME 
and they were not shown to the participants in the experiment in Section 6. 
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